2 research outputs found

    Running deep learning applications on resource constrained devices

    Get PDF
    The high accuracy of Deep Neural Networks (DNN) come at the expense of high computational cost and memory requirements. During inference, the data is often collected on the edge device which are resource-constrained. The existing solutions for edge deployment include i) executing the entire DNN on the edge (EDGE-ONLY), ii) sending the input from edge to cloud where the DNN is processed (CLOUD-ONLY), and iii) splitting the DNN to execute partially on the edge and partially on the cloud (SPLIT). The choice of deployment between EDGE-ONLY, CLOUD-ONLY and SPLIT is determined by several operating constraints such as device resources and network speed, and application constraints such as latency and accuracy. The EDGE-ONLY approach requires compact DNN with low compute and memory requirements. Thus, the emerging class of DNNs employ low-rank convolutions (LRCONVs) which reduce one or more dimensions compared to the spatial convolutions (CONV). Prior research in hardware accelerators has largely focused on CONVs. The LRCONVs such as depthwise and pointwise convolutions exhibit lower arithmetic intensity and lower data reuse. Thus, LRCONVs result in low hardware utilization and high latency. In our first work, we systematically explore the design space of Cross-layer dataflows to exploit data reuse across layers for emerging DNNs in EDGE-ONLY scenarios. We develop novel fine-grain cross-layer dataflows for LRCONVs that support partial loop dimension completion. Our tool, X-Layer decouples the nested loops in a pipeline and combines them to create a common outer dataflow and several inner dataflows. The CLOUD-ONLY approach can suffer from high latency due to the high transmission cost of large input data from the edge to the cloud. This could be a problem, especially for latency-critical applications. Thankfully, the SPLIT approach reduces latency compared to the CLOUD-ONLY approach. However, existing solutions only split the DNN in floating-point precision. Executing floating-point precision on the edge device can occupy large memory and reduce the potential options for SPLIT solutions. In our second work, we expand and explore the search space of SPLIT solutions by jointly applying mixed-precision post-training quantization and DNN graph split. Our work, Auto-Split finds a balance in the trade-off among the model accuracy, edge device capacity, transmission cost, and the overall latency

    Leveraging Compiler Alias Analysis To Free Accelerators from Load-Store Queues

    Get PDF
    Hardware accelerators are an energy efficient alternative to general purpose processors for specific program regions. They have relied on the compiler to extract instruction level parallelism but may waste significant energy in memory disambiguation and discovering memory level parallelism (MLP). Currently, accelerators either i) Define the problem away, and rely on massively parallel programming models [1, 48] to extract MLP. ii) Reuse the Out of Order (OoO) processor [7, 28], and rely on power hungry load-store queues (LSQs) for memory disambiguation, or iii) Serialize – some accelerators [47] focus on program regions where MLP is not important and simply serialize memory operations. We present NACHOS, a compiler assisted energy efficient approach to memory disambiguation, which completely eliminates the need for an LSQ. NACHOS classifies memory operations pairwise into those that don’t alias (i.e., independent memory operations), must alias (i.e., ordering is required between memory operations), and may alias (i.e., compiler is unsure). To enforce program order between must alias memory operations, the compiler inserts ordering edges that are enforced as def-use data dependencies. When the compiler is unsure (i.e., may alias) about a pair of memory operations, the hardware checks if they are independent. We demonstrate that compiler alias analysis with additional refinement can achieve high accuracy for hardware accelerated regions. In our workload suite comprising of SPEC2k, SPEC2k6, and PARSEC workloads; Across 15 applications NACHOS imposes no energy overhead over the function units (i.e., compiler resolves all dependencies), and in another 12 applications NACHOS consumes =17% of function unit energy (max: 53% in povray). Overall NACHOS achieves performance similar to an optimized LSQ and adds an overhead equal to 2.3X of compute energy
    corecore